NB: The worksheet has beed developed and prepared by Maxim Romanov for the course “R for Historical Research” (U Vienna, Spring 2019).

1 Goals

2 Preliminaries

2.1 Libraries

The following are the librsries that we will need for this section. Install those that you do not have yet.

#install.packages("tidyverse", "readr", "stringr", "text2vec", "tidytext", "wordcloud", "RColorBrewer"", "quanteda", "readtext", "igraph")

# General ones 
library(tidyverse)
library(readr)
library(RColorBrewer)

# text analysis specific
library(stringr)
library(text2vec)
library(tidytext)
library(wordcloud)
library(quanteda)
library(readtext)
library(igraph)

2.2 “Dispatch”

d1861 <- read.delim("./dispatch/dispatch_1861.tsv", encoding="UTF-8", header=TRUE, quote="")

2.3 Functions in R

Functions are groups of related statements that perform a specific task, which help breaking a program into smaller and modular chunks. As programs grow larger and larger, functions make them more organized and manageable. Functions help avoiding repetition and makes code reusable.

Most programming languages, R including, come with a lot of pre-defined—or built-in—functions. Essentially, all statements that take arguments in parentheses are functions. For instance, in the code chunk above, read.delim() is a function that takes as its arguments: 1) filename (or, path to a file); 2) encoding; 3) specifies that the file has a header; and 4) not using " as a special character. We can also write our own functions, which take care of sets of operations thet we tend to repeat again and again.

Later, take a look at this video by one of the key R developers, and check this tutorial.

2.3.1 Simple Function Example: Hypothenuse

(From Wikipedia) In geometry, a hypotenuse is the longest side of a right-angled triangle, the side opposite the right angle. The length of the hypotenuse of a right triangle can be found using the Pythagorean theorem, which states that the square of the length of the hypotenuse equals the sum of the squares of the lengths of the other two sides (catheti). For example, if one of the other sides has a length of 3 (when squared, 9) and the other has a length of 4 (when squared, 16), then their squares add up to 25. The length of the hypotenuse is the square root of 25, that is, 5.

Let’s write a function that takes lengths of catheti as arguments and returns the length of hypothenuse:

hypothenuse <- function(cathetus1, cathetus2) {
  hypothenuse<- sqrt(cathetus1*cathetus1+cathetus2*cathetus2)
  print(paste0("In the triangle with catheti of length ",
              cathetus1, " and ", cathetus2, ", the length of hypothenuse is ", hypothenuse))
  #return(hypothenuse)
}
hypothenuse(3,4)
## [1] "In the triangle with catheti of length 3 and 4, the length of hypothenuse is 5"

2.3.2 More complex one: Cleaning Text

Let’s say we want to clean up a text so that it is easier to analyze it: 1) convert everithing to lower case; 2) remove all non-alphanumeric characters; and 3) make sure that there are no multiple spaces:

clean_up_text = function(x) {
  x %>% 
    str_to_lower %>% # make text lower case
    str_replace_all("[^[:alnum:]]", " ") %>% # remove non-alphanumeric symbols
    str_replace_all("\\s+", " ") # collapse multiple spaces
}
text = "This is a sentence with punctuation, which mentions Vienna, the capital of Austria."
clean_up_text(text)
## [1] "this is a sentence with punctuation which mentions vienna the capital of austria "

3 Word frequencies and Word clouds

Let’s load all issues of Dispatch from 1862. We can quickly check what types of articles are there in those issues.

library(tidytext)
# the "quotes" attribute avoids that R identifies quotes differently
d1862 <- read.delim("./dispatch/dispatch_1862.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)

d1862 %>%
  count(type, sort=T) # to find the most frequent words

We can create subsets of articles based on their types.

articles_d1862 <- d1862 %>%
  filter(type=="article")

advert_d1862 <- d1862 %>%
  filter(type=="advert")

orders_d1862 <- d1862 %>%
  filter(type=="orders")

death_d1862 <- d1862 %>%
  filter(type=="death" | type == "died")

married_d1862 <- d1862 %>%
  filter(type=="married")

Now, let’s tidy them up: to work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.

test_set <- death_d1862

test_set_tidy <- test_set %>%
  mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
  select(-type) %>%
  unnest_tokens(word, text) %>%
  mutate(word_number = row_number())

test_set_tidy
# stop words can remove data that actually should not be removed. If possible use own stop words list
data("stop_words")

test_set_tidy_clean <- test_set_tidy %>%
  anti_join(stop_words, by="word")

test_set_tidy_clean

3.1 Word Frequencies

test_set_tidy %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort = TRUE) 

3.2 Wordclouds

library(wordcloud)
library("RColorBrewer")

test_set_tidy_clean <- test_set_tidy %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=T)

set.seed(1234)
wordcloud(words=test_set_tidy_clean$word, freq=test_set_tidy_clean$n,
          min.freq = 1, rot.per = .25, random.order=FALSE, #scale=c(5,.5),
          max.words=150, colors=brewer.pal(8, "Dark2"))

  1. What can we glean out form this wordcloud? Create a wordcloud for obituaries. What does stand out?

your response

test_set <- death_d1862

test_set_tidy <- test_set %>%
  mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
  select(-type) %>%
  unnest_tokens(word, text) %>%
  mutate(word_number = row_number())


test_set_tidy_clean <- test_set_tidy %>%
  anti_join(stop_words, by="word") %>%
  count(word, sort=T)

set.seed(1234)
wordcloud(words=test_set_tidy_clean$word, freq=test_set_tidy_clean$n,
          min.freq = 1, rot.per = .25, random.order=FALSE, #scale=c(5,.5),
          max.words=150, colors=brewer.pal(8, "Dark2"))

For more details on generating word clouds in R, see: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know.

4 Word Distribution Plots

4.1 Simple — Dispatch

Where exactly in a text certain words appear may be of extreme value for understanding a specific text. Let’s try to plot something like that for all the issues of Dispatch of 1861.

d1862 <- read.delim("./dispatch/dispatch_1862.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)

test_set <- d1862

test_set$date <- as.Date(test_set$date, format="%Y-%m-%d")

test_set_tidy <- test_set %>%
  mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
  select(-type) %>%
  unnest_tokens(word, text) %>%
  mutate(word_number = row_number())

test_set_tidy
ourWord = "donelson"
word_occurance_vector <- which(test_set_tidy$word == ourWord)

plot(0, type='n', #ann=FALSE,
     xlim=c(1,length(test_set_tidy$word)), ylim=c(0,1),
     main=paste0("Dispersion Plot of `", ourWord, "` in Dispatch (1862)"),
     xlab="Newspaper Time", ylab=ourWord, yaxt="n")
segments(x0=word_occurance_vector, x1=word_occurance_vector, y0=0, y1=2)

# col=rgb(0,0,0,alpha=0.3) -- can be included as a parameter to segment to make lines more transparent

4.2 Simple — a Star Wars Example

This kind of plot works better with texts rather than with newspapers. Let’s take a look at a script of Episode I:

SW_to_DF <- function(path_to_file, episode){
  sw_sentences <- scan(path_to_file, what="character", sep="\n")
  sw_sentences <- as.character(sw_sentences)
  sw_sentences <- gsub("([A-Z]) ([A-Z])", "\\1_\\2", sw_sentences)
  sw_sentences <- gsub("([A-Z])-([A-Z])", "\\1_\\2", sw_sentences)
  sw_sentences <- as.data.frame(cbind(episode, sw_sentences), stringsAsFactors=FALSE)
  colnames(sw_sentences) <- c("episode", "sentences")
  return(sw_sentences)
}

sw1_df <- SW_to_DF("./sw_scripts/sw1.md", "sw1")

sw1_df_tidy <- sw1_df %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(sentences, regex("^#", ignore_case = TRUE))))

sw1_df_tidy <- sw1_df_tidy %>%
  unnest_tokens(word, sentences)
ourWord = "anakin"
word_occurance_vector <- which(sw1_df_tidy$word == ourWord)

#plot(x=word_occurance_vector, type="h", )

plot(0, type='n', #ann=FALSE,
     xlim=c(1,length(sw1_df_tidy$word)), ylim=c(0,1),
     main=paste0("Dispersion Plot of `", ourWord, "` in SW1"),
     xlab="Movie Time", ylab=ourWord, yaxt="n")
segments(x0=word_occurance_vector, x1=word_occurance_vector, y0=0, y1=2)

# col=rgb(0,0,0,alpha=0.3) -- can be included as a parameter to segment to make lines more transparent

5 Word Distribution Plots: With Frequencies Over Time

For newspapers—and other diachronic corpora—a different approach will work better:

test_set_tidy_freqDay <- test_set_tidy %>%
  anti_join(stop_words, by="word") %>%
  group_by(date) %>%
  count(word) 
  
test_set_tidy_freqDay
# interesting examples:
# deserters, killed,
# donelson (The Battle of Fort Donelson took place in early February of 1862),
# manassas (place of the Second Bull Run)
# shiloh (Battle of Shiloh took place in April of 1862)

ourWord = "shiloh" 

test_set_tidy_word <- test_set_tidy_freqDay %>%
  filter(word==ourWord)

plot(x=test_set_tidy_word$date, y=test_set_tidy_word$n, type="l", lty=3, lwd=1,
     main=paste0("Word `", ourWord, "` over time"),
     xlab = "1862 - Dispatch coverage", ylab = "word frequency per day")
segments(x0=test_set_tidy_word$date, x1=test_set_tidy_word$date, y0=0, y1=test_set_tidy_word$n, lty=1, lwd=2)

6 KWIC: Keywords-in-Context

library(quanteda)
library(readtext)

dispatch1862 <- readtext("./dispatch/dispatch_1862.tsv", text_field = "text", quote="")
dispatch1862corpus <- corpus(dispatch1862)

pattern= can also take vectors (for example, c("soldier*", "troop*")); you can also search for phrases with pattern=phrase("fort donelson"); window= defines how many words will be shown before and after the match.

kwic_test <- kwic(dispatch1862corpus, pattern = 'lincoln')

kwic_test

Try View(kwic_test) in your console!

7 Document similarity/distance measures: text2vec library

Document similarity—or distance—measures are valuable for a variety of tasks, such as identification of texts with similar (or the same) content.

prep_fun = function(x) {
  x %>% 
    str_to_lower %>% # make text lower case
    str_replace_all("[^[:alnum:]]", " ") %>% # remove non-alphanumeric symbols
    str_replace_all("\\s+", " ") # collapse multiple spaces
}
d1862 <- read.delim("./dispatch/dispatch_1862.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)

Let’s just filter it down to some sample that would not take too much time to process. We also need to clean up our texts for better calculations.

sample_d1862 <- d1862 %>%
  filter(type=="advert")

sample_d1862$text <- prep_fun(sample_d1862$text)
# shared vector space
it = itoken(as.vector(sample_d1862$text))
v = create_vocabulary(it) %>%
  prune_vocabulary(term_count_min = 2) # 
vectorizer = vocab_vectorizer(v)

prune_vocabulary() is a useful function if you work with a large corpus; using term_count_min= would allow to remove low frequency vocabulary from our vector space and lighten up calculations.

Now, we need to create a document-feature matrix:

dtm = create_dtm(it, vectorizer)

The text2vec library can calculate a several different kinds of distances (details: http://text2vec.org/similarity.html), which include:

7.1 Jaccard similarity

jaccardMatrix = sim2(dtm, dtm, method = "jaccard", norm = "none")
jaccardMatrix@Dimnames[[1]] <- as.vector(sample_d1862$id)
jaccardMatrix@Dimnames[[2]] <- as.vector(sample_d1862$id)

Let’s take a look at a small section of our matrix. Can you read it? How should this data look in tidy format?

jaccardMatrix[1:4, 1:2]
## 4 x 2 sparse Matrix of class "dgCMatrix"
##                       1862-04-22_advert_244 1862-04-22_advert_245
## 1862-04-22_advert_244            1.00000000            0.05063291
## 1862-04-22_advert_245            0.05063291            1.00000000
## 1862-04-16_advert_6              0.10588235            0.11764706
## 1862-04-16_advert_7              0.06578947            0.07317073

Converting matrix into a proper tidy data frame is a bit tricky. Luckily, igraph library can be extremely helpful here. We can treat our matrix as edges, where each number is the weight of each given edge. Loading this data into igraph will help us to avoid heavy-lifting on convertion as it can do all the complicated reconfigurations of our data, converting it into a proper dataframe that conforms to the principles of tidy data.

All steps include:

  1. convert our initial object from a sparse matrix format into a regular matrix format;
  2. rename rows and columns (we have done this already though);
  3. create igraph object from our regular matrix;
  4. extract edges dataframe.
jaccardMatrix <- as.matrix(jaccardMatrix)

library(igraph)
jaccardNW <- graph.adjacency(jaccardMatrix, mode="undirected", weighted=TRUE)
jaccardNW <- simplify(jaccardNW)
jaccard_sim_df <- as_data_frame(jaccardNW, what="edges")

colnames(jaccard_sim_df) <- c("text1", "text2", "jaccardSimilarity")
t_jaccard_sim_df_subset <- jaccard_sim_df %>%
  filter(jaccardSimilarity > 0.49) %>%
  arrange(desc(jaccardSimilarity), .by_group=T)

t_jaccard_sim_df_subset

Let’s check the texts of 1862-03-10_advert_166 and 1862-02-10_advert_170, which have the score of 1.0000000 (a complete match).

example <- d1862 %>%
  filter(id=="1862-09-08_advert_171")
paste(example[5])
## [1] "A desirable Framed House and Lot on Mill street, in Adams's Valley, at Auction. -- We will sell upon the premises, on Thursday, the 11th day of September, commencing at 4½ o'clock P. M., a very desirable Framed House and Lot located on Mill street, in Adams's Valley, and near to the new workshops of the Virginia Central Railroad Company. It contains four good rooms, and is particularly adapted for a small sized family. The lot fronts 40 feet and runs back 200 feet. Terms. -- One-third cash; the balance at 4 and 8 months, for negotiable notes, with interest added, and secured by a trust deed. se 5 Jas. M. Taylor & Son, Auct'rs."
example <- d1862 %>%
  filter(id=="1862-04-07_advert_207")
paste(example[5])
## [1] "Desirable Framed House and Lot on Federal street, in Sidney, at Auction. -- We will sell upon the premises, on Tuesday, the 8th day of April, commencing at 4 ½ o'clock P. M., the desirable framed House, on the north side of Elmwood, near to Grove street, now in the occupancy of A. B. Hall. It contains several good rooms, Kitchen, & c., and adapted for a small-sized family; a well of water in the yard. The Lot fronts 30 feet, and runs back 180 feet to an alley 20 feet wide. Terms -- One-third cash; the balance at 6 and 12 months, for negotiable notes, with interest added, and secured by a trust deed. The purchaser to pay the taxes for 1862. James M. Taylor & Son, Auctioneers."
  1. Check http://text2vec.org/similarity.html and calculate cosine and euclidean distances for the same set of texts. What is the score for the same two texts? How do these scores differ in your opinion? (Take a look at a few examples with maximum match!).

7.2 Cosine Distance

cosineMatrix = sim2(dtm, dtm, method = "cosine", norm = "none")
cosineMatrix@Dimnames[[1]] <- as.vector(sample_d1862$id)
cosineMatrix@Dimnames[[2]] <- as.vector(sample_d1862$id)

Let’s take a look at a small section of our matrix. Can you read it? How should this data look in tidy format?

cosineMatrix[1:4, 1:2]
## 4 x 2 sparse Matrix of class "dgCMatrix"
##                       1862-04-22_advert_244 1862-04-22_advert_245
## 1862-04-22_advert_244                   143                    20
## 1862-04-22_advert_245                    20                    32
## 1862-04-16_advert_6                      38                    14
## 1862-04-16_advert_7                      16                     6

Converting matrix into a proper tidy data frame is a bit tricky. Luckily, igraph library can be extremely helpful here. We can treat our matrix as edges, where each number is the weight of each given edge. Loading this data into igraph will help us to avoid heavy-lifting on convertion as it can do all the complicated reconfigurations of our data, converting it into a proper dataframe that conforms to the principles of tidy data.

All steps include:

  1. convert our initial object from a sparse matrix format into a regular matrix format;
  2. rename rows and columns (we have done this already though);
  3. create igraph object from our regular matrix;
  4. extract edges dataframe.
cosineMatrix <- as.matrix(cosineMatrix)

library(igraph)
cosineNW <- graph.adjacency(cosineMatrix, mode="undirected", weighted=TRUE)
cosineNW <- simplify(cosineNW)
cosine_sim_df <- as_data_frame(cosineNW, what="edges")

colnames(cosine_sim_df) <- c("text1", "text2", "cosineSimilarity")
t_cosine_sim_df_subset <- cosine_sim_df %>%
  filter(cosineSimilarity > 0.49) %>%
  arrange(desc(cosineSimilarity), .by_group=T)

t_cosine_sim_df_subset

Let’s check the texts of 1862-03-10_advert_166 and 1862-02-10_advert_170, which have the score of 1.0000000 (a complete match).

example <- d1862 %>%
  filter(id=="1862-09-08_advert_171")
paste(example[5])
## [1] "A desirable Framed House and Lot on Mill street, in Adams's Valley, at Auction. -- We will sell upon the premises, on Thursday, the 11th day of September, commencing at 4½ o'clock P. M., a very desirable Framed House and Lot located on Mill street, in Adams's Valley, and near to the new workshops of the Virginia Central Railroad Company. It contains four good rooms, and is particularly adapted for a small sized family. The lot fronts 40 feet and runs back 200 feet. Terms. -- One-third cash; the balance at 4 and 8 months, for negotiable notes, with interest added, and secured by a trust deed. se 5 Jas. M. Taylor & Son, Auct'rs."
example <- d1862 %>%
  filter(id=="1862-04-07_advert_207")
paste(example[5])
## [1] "Desirable Framed House and Lot on Federal street, in Sidney, at Auction. -- We will sell upon the premises, on Tuesday, the 8th day of April, commencing at 4 ½ o'clock P. M., the desirable framed House, on the north side of Elmwood, near to Grove street, now in the occupancy of A. B. Hall. It contains several good rooms, Kitchen, & c., and adapted for a small-sized family; a well of water in the yard. The Lot fronts 30 feet, and runs back 180 feet to an alley 20 feet wide. Terms -- One-third cash; the balance at 6 and 12 months, for negotiable notes, with interest added, and secured by a trust deed. The purchaser to pay the taxes for 1862. James M. Taylor & Son, Auctioneers."

7.3 Euclidean Distance

euclidMatrix = sim2(dtm, dtm, method = "cosine", norm = "none")
euclidMatrix@Dimnames[[1]] <- as.vector(sample_d1862$id)
euclidMatrix@Dimnames[[2]] <- as.vector(sample_d1862$id)

Let’s take a look at a small section of our matrix. Can you read it? How should this data look in tidy format?

euclidMatrix[1:4, 1:2]
## 4 x 2 sparse Matrix of class "dgCMatrix"
##                       1862-04-22_advert_244 1862-04-22_advert_245
## 1862-04-22_advert_244                   143                    20
## 1862-04-22_advert_245                    20                    32
## 1862-04-16_advert_6                      38                    14
## 1862-04-16_advert_7                      16                     6

Converting matrix into a proper tidy data frame is a bit tricky. Luckily, igraph library can be extremely helpful here. We can treat our matrix as edges, where each number is the weight of each given edge. Loading this data into igraph will help us to avoid heavy-lifting on convertion as it can do all the complicated reconfigurations of our data, converting it into a proper dataframe that conforms to the principles of tidy data.

All steps include:

  1. convert our initial object from a sparse matrix format into a regular matrix format;
  2. rename rows and columns (we have done this already though);
  3. create igraph object from our regular matrix;
  4. extract edges dataframe.
euclidMatrix <- as.matrix(euclidMatrix)

library(igraph)
euclidNW <- graph.adjacency(euclidMatrix, mode="undirected", weighted=TRUE)
euclidNW <- simplify(euclidNW)
euclid_sim_df <- as_data_frame(euclidNW, what="edges")

colnames(euclid_sim_df) <- c("text1", "text2", "euclidSimilarity")
t_euclid_sim_df_subset <- euclid_sim_df %>%
  filter(euclidSimilarity > 0.49) %>%
  arrange(desc(euclidSimilarity), .by_group=T)

t_euclid_sim_df_subset

Let’s check the texts of 1862-03-10_advert_166 and 1862-02-10_advert_170, which have the score of 1.0000000 (a complete match).

example <- d1862 %>%
  filter(id=="1862-09-08_advert_171")
paste(example[5])
## [1] "A desirable Framed House and Lot on Mill street, in Adams's Valley, at Auction. -- We will sell upon the premises, on Thursday, the 11th day of September, commencing at 4½ o'clock P. M., a very desirable Framed House and Lot located on Mill street, in Adams's Valley, and near to the new workshops of the Virginia Central Railroad Company. It contains four good rooms, and is particularly adapted for a small sized family. The lot fronts 40 feet and runs back 200 feet. Terms. -- One-third cash; the balance at 4 and 8 months, for negotiable notes, with interest added, and secured by a trust deed. se 5 Jas. M. Taylor & Son, Auct'rs."
example <- d1862 %>%
  filter(id=="1862-04-07_advert_207")
paste(example[5])
## [1] "Desirable Framed House and Lot on Federal street, in Sidney, at Auction. -- We will sell upon the premises, on Tuesday, the 8th day of April, commencing at 4 ½ o'clock P. M., the desirable framed House, on the north side of Elmwood, near to Grove street, now in the occupancy of A. B. Hall. It contains several good rooms, Kitchen, & c., and adapted for a small-sized family; a well of water in the yard. The Lot fronts 30 feet, and runs back 180 feet to an alley 20 feet wide. Terms -- One-third cash; the balance at 6 and 12 months, for negotiable notes, with interest added, and secured by a trust deed. The purchaser to pay the taxes for 1862. James M. Taylor & Son, Auctioneers."

8 Homework

Work through Chapter 9 of Arnold, Taylor, and Lauren Tilton. 2015. Humanities Data in R. New York, NY: Springer Science+Business Media. (on Moodle!): create a notebook with all the code discusses there and send it via email (share via DropBox or some other service, if too large).